Desde el auge tecnológico de secuenciación del genoma humano…
Clasificar la patogenicidad de las mutaciones
Causalidad en enfermedades
24 APRIL 2014
Se utilizan las mutaciones missense
Se utilizan 25 predictores:
## [1] "SIFT_score" "FATHMM_score"
## [3] "PROVEAN_score" "VEST4_score"
## [5] "Polyphen2_HDIV_score" "Polyphen2_HVAR_score"
## [7] "MutationAssessor_score" "LRT_score"
## [9] "fathmm.MKL_coding_score" "fathmm.XF_coding_score"
## [11] "MutPred_score" "PrimateAI_score"
## [13] "integrated_fitCons_score" "GM12878_fitCons_score"
## [15] "H1.hESC_fitCons_score" "HUVEC_fitCons_score"
## [17] "GERP.._RS" "phyloP100way_vertebrate"
## [19] "phyloP30way_mammalian" "phyloP17way_primate"
## [21] "phastCons100way_vertebrate" "phastCons30way_mammalian"
## [23] "phastCons17way_primate" "X29way_logOdds"
## [25] "bStatistic"
UMD TP53 contiene 80406 mutaciones, de las cuales 57885 son mutaciones missense.
Las mutaciones usan la nomenclatura NM_000546.5 (GRCH37, hg19). Utilizando , se transformaron la coordenadas a NC_000017.11 (GRCH38, hg38)
Se tomaron las variaciones únicas y se cruzaron con la base de datos dbNSFP4.0b1a, a diciembre 8 del 2018, para conseguir los 25 predictores (26 variables con REVEL).
Se obtuvieron 1764 variantes únicas luego de este cruce.
Se perdieron 2 variaciones que no eran verdaderas mutaciones missense. La variable respuesta patogenicidad se obtiene de la base de datos UMD.
| Patogenicidad | Frecuencia |
|---|---|
| Benign | 80 (0.1%) |
| Possibly pathogenic | 9,979 (17.2%) |
| Likely Pathogenic | 5,871 (10.1%) |
| Pathogenic | 38,145 (65.9%) |
| VUS | 3,810 (6.6%) |
| Patogenicidad | Frecuencia |
|---|---|
| Benign | 16 (0.9%) |
| Possibly pathogenic | 465 (26.4%) |
| Likely Pathogenic | 90 (5.1%) |
| Pathogenic | 102 (5.8%) |
| VUS | 1,091 (61.8%) |
Variable sccontinuas son resumidas por mediana y rango completo
|
Benign No. 16 |
Possibly pathogenic No. 465 |
Likely Pathogenic No. 90 |
Pathogenic No. 102 |
VUS No. 1,091 |
Total No. 1,764 |
|
|---|---|---|---|---|---|---|
| SIFT | ||||||
| Median (IQR) | 0.26 (0.21 - 0.45) | 0.00 (0.00 - 0.00) | 0.00 (0.00 - 0.00) | 0.00 (0.00 - 0.00) | 0.06 (0.00 - 0.24) | 0.01 (0.00 - 0.12) |
| Missing | 13 (81.25%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 13 (0.74%) |
| FATHMM | ||||||
| Median (IQR) | -5.45 (-5.48 - -5.44) | -6.97 (-7.27 - -6.72) | -7.29 (-7.41 - -6.99) | -7.30 (-7.50 - -7.04) | -6.31 (-6.73 - -5.49) | -6.69 (-7.05 - -5.56) |
| Missing | 13 (81.25%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 13 (0.74%) |
| PROVEAN | ||||||
| Median (IQR) | -0.14 (-0.18 - 0.15) | -4.88 (-6.55 - -2.89) | -6.21 (-7.75 - -3.89) | -6.17 (-8.43 - -3.94) | -1.17 (-3.03 - -0.19) | -2.62 (-5.11 - -0.66) |
| Missing | 13 (81.25%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 13 (0.74%) |
| VEST4 | ||||||
| Median (IQR) | 0.08 (0.08 - 0.12) | 0.82 (0.65 - 0.90) | 0.92 (0.81 - 0.96) | 0.93 (0.87 - 0.96) | 0.36 (0.22 - 0.57) | 0.56 (0.30 - 0.82) |
| Missing | 13 (81.25%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 13 (0.74%) |
| Polyphen2 HDIV | ||||||
| Median (IQR) | 0.01 (0.00 - 0.05) | 1.00 (0.95 - 1.00) | 1.00 (0.99 - 1.00) | 1.00 (0.99 - 1.00) | 0.16 (0.00 - 0.96) | 0.87 (0.02 - 1.00) |
| Missing | 13 (81.25%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 13 (0.74%) |
| Polyphen2 HVAR | ||||||
| Median (IQR) | 0.00 (0.00 - 0.02) | 0.99 (0.84 - 1.00) | 1.00 (0.96 - 1.00) | 0.99 (0.98 - 1.00) | 0.19 (0.01 - 0.88) | 0.77 (0.06 - 0.99) |
| Missing | 13 (81.25%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 13 (0.74%) |
| MutationAssessor | ||||||
| Median (IQR) | 0.97 (0.96 - 1.16) | 2.88 (2.35 - 3.18) | 3.16 (2.90 - 3.27) | 3.20 (2.93 - 3.29) | 1.93 (1.32 - 2.51) | 2.36 (1.67 - 2.98) |
| Missing | 13 (81.25%) | 0 (0%) | 1 (1.11%) | 0 (0%) | 3 (0.27%) | 17 (0.96%) |
| LRT | ||||||
| Median (IQR) | 0.36 (0.18 - 0.37) | 0.00 (0.00 - 0.00) | 0.00 (0.00 - 0.00) | 0.00 (0.00 - 0.00) | 0.00 (0.00 - 0.13) | 0.00 (0.00 - 0.03) |
| Missing | 13 (81.25%) | 0 (0%) | 0 (0%) | 0 (0%) | 10 (0.92%) | 23 (1.30%) |
| fathmm MKL coding | ||||||
| Median (IQR) | 0.05 (0.03 - 0.43) | 0.99 (0.96 - 0.99) | 0.99 (0.98 - 1.00) | 0.99 (0.98 - 0.99) | 0.79 (0.18 - 0.97) | 0.95 (0.40 - 0.99) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| fathmm XF coding | ||||||
| Median (IQR) | 0.06 (0.05 - 0.09) | 0.90 (0.64 - 0.95) | 0.93 (0.81 - 0.96) | 0.94 (0.85 - 0.96) | 0.31 (0.15 - 0.68) | 0.57 (0.21 - 0.91) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| MutPred | ||||||
| Median (IQR) | 0.28 (0.20 - 0.55) | 0.85 (0.73 - 0.92) | 0.91 (0.86 - 0.96) | 0.94 (0.89 - 0.97) | 0.47 (0.26 - 0.70) | 0.68 (0.38 - 0.86) |
| Missing | 4 (25.00%) | 16 (3.44%) | 1 (1.11%) | 5 (4.90%) | 24 (2.20%) | 50 (2.83%) |
| PrimateAI | ||||||
| Median (IQR) | 0.34 (0.30 - 0.37) | 0.60 (0.49 - 0.69) | 0.66 (0.60 - 0.72) | 0.69 (0.59 - 0.74) | 0.44 (0.35 - 0.54) | 0.50 (0.39 - 0.63) |
| Missing | 14 (87.50%) | 4 (0.86%) | 1 (1.11%) | 0 (0%) | 58 (5.32%) | 77 (4.37%) |
| integrated fitCons | ||||||
| Median (IQR) | 0.67 (0.64 - 0.67) | 0.72 (0.72 - 0.72) | 0.72 (0.72 - 0.72) | 0.72 (0.72 - 0.72) | 0.72 (0.72 - 0.72) | 0.72 (0.72 - 0.72) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| GM12878 fitCons | ||||||
| Median (IQR) | 0.70 (0.67 - 0.70) | 0.70 (0.70 - 0.70) | 0.70 (0.70 - 0.70) | 0.70 (0.70 - 0.70) | 0.70 (0.70 - 0.70) | 0.70 (0.70 - 0.70) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| H1.hESC fitCons | ||||||
| Median (IQR) | 0.61 (0.61 - 0.61) | 0.70 (0.70 - 0.72) | 0.70 (0.70 - 0.72) | 0.72 (0.70 - 0.72) | 0.70 (0.70 - 0.72) | 0.70 (0.70 - 0.72) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| HUVEC fitCons | ||||||
| Median (IQR) | 0.66 (0.32 - 0.69) | 0.74 (0.74 - 0.74) | 0.74 (0.74 - 0.74) | 0.74 (0.74 - 0.74) | 0.74 (0.74 - 0.74) | 0.74 (0.74 - 0.74) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| GERP++ RS | ||||||
| Median (IQR) | 0.77 (-0.64 - 1.21) | 4.62 (3.53 - 5.13) | 4.62 (3.64 - 5.28) | 4.62 (3.64 - 5.13) | 2.92 (0.29 - 4.45) | 3.65 (1.40 - 4.71) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| phyloP100way vertebrate | ||||||
| Median (IQR) | -0.04 (-0.33 - 0.46) | 6.94 (2.62 - 8.18) | 7.91 (5.31 - 9.30) | 7.91 (5.30 - 7.91) | 1.14 (0.13 - 3.79) | 2.51 (0.44 - 7.47) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| phyloP30way mammalian | ||||||
| Median (IQR) | 0.13 (-0.41 - 0.73) | 1.14 (1.02 - 1.30) | 1.14 (1.02 - 1.18) | 1.03 (1.02 - 1.14) | 1.00 (0.11 - 1.14) | 1.03 (0.18 - 1.17) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| phyloP17way primate | ||||||
| Median (IQR) | 0.47 (-0.20 - 0.67) | 0.67 (0.60 - 0.75) | 0.67 (0.60 - 0.68) | 0.60 (0.60 - 0.66) | 0.66 (0.35 - 0.67) | 0.66 (0.60 - 0.68) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| phastCons100way vertebrate | ||||||
| Median (IQR) | 0.00 (0.00 - 0.02) | 1.00 (1.00 - 1.00) | 1.00 (1.00 - 1.00) | 1.00 (1.00 - 1.00) | 0.68 (0.00 - 1.00) | 1.00 (0.02 - 1.00) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| phastCons30way mammalian | ||||||
| Median (IQR) | 0.22 (0.06 - 0.66) | 0.97 (0.83 - 1.00) | 0.98 (0.84 - 1.00) | 0.98 (0.87 - 1.00) | 0.72 (0.03 - 0.98) | 0.90 (0.10 - 0.99) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| phastCons17way primate | ||||||
| Median (IQR) | 0.97 (0.86 - 0.99) | 0.97 (0.74 - 1.00) | 0.98 (0.75 - 1.00) | 0.98 (0.84 - 1.00) | 0.77 (0.21 - 0.99) | 0.91 (0.55 - 0.99) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| X29way logOdds | ||||||
| Median (IQR) | 4.24 (3.38 - 4.87) | 12.31 (9.50 - 13.83) | 12.94 (10.68 - 15.36) | 13.59 (10.88 - 15.66) | 8.18 (5.20 - 11.98) | 9.98 (6.46 - 13.06) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 7 (0.64%) | 7 (0.40%) |
| bStatistic | ||||||
| Median (IQR) | 439.00 (436.00 - 439.00) | 433.00 (433.00 - 434.00) | 433.00 (433.00 - 434.00) | 433.00 (432.00 - 434.00) | 434.00 (433.00 - 434.00) | 434.00 (433.00 - 434.00) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
| REVEL | ||||||
| Median (IQR) | 0.28 (0.26 - 0.35) | 0.87 (0.74 - 0.93) | 0.91 (0.87 - 0.94) | 0.92 (0.88 - 0.95) | 0.52 (0.38 - 0.72) | 0.69 (0.46 - 0.88) |
| Missing | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) |
|
Total No. 673 |
Complete No. 632 |
Incomplete No. 41 |
|
|---|---|---|---|
| Pathogenicity | |||
| Benign | 16 (2.4%) | 0 (0.0%) | 16 (100.0%) |
| Possibly pathogenic | 465 (69.1%) | 447 (96.1%) | 18 (3.9%) |
| Likely Pathogenic | 90 (13.4%) | 88 (97.8%) | 2 (2.2%) |
| Pathogenic | 102 (15.2%) | 97 (95.1%) | 5 (4.9%) |
| Variable | Faltantes | % |
|---|---|---|
| MutPred_score | 26 | 3.86 |
| PrimateAI_score | 19 | 2.82 |
| MutationAssessor_score | 14 | 2.08 |
| SIFT_score | 13 | 1.93 |
| FATHMM_score | 13 | 1.93 |
| PROVEAN_score | 13 | 1.93 |
| VEST4_score | 13 | 1.93 |
| Polyphen2_HDIV_score | 13 | 1.93 |
| Polyphen2_HVAR_score | 13 | 1.93 |
| LRT_score | 13 | 1.93 |
| cDNA_variant | HG19_variant | HG38_variant | position | ref | alt | aaref | aaalt | Ensembl_transcriptid | Ensembl_proteinid | Uniprot_acc | Uniprot_entry | Codon | WT_AA_1 | WT_AA_3 | Mutant_AA_1 | Mutant_AA_3 | SIFT_score | FATHMM_score | PROVEAN_score | VEST4_score | Polyphen2_HDIV_score | Polyphen2_HVAR_score | MutationAssessor_score | LRT_score | fathmm.MKL_coding_score | fathmm.XF_coding_score | MutPred_score | PrimateAI_score | integrated_fitCons_score | GM12878_fitCons_score | H1.hESC_fitCons_score | HUVEC_fitCons_score | GERP.._RS | phyloP100way_vertebrate | phyloP30way_mammalian | phyloP17way_primate | phastCons100way_vertebrate | phastCons30way_mammalian | phastCons17way_primate | X29way_logOdds | bStatistic | REVEL_score | Pathogenicity |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NM_000546.5:c.139C>T | NC_000017.10:g.7579548G>A | NC_000017.11:g.7676230G>A | 7676230 | G | A | P | S | ENST00000269305 | ENSP00000269305 | P04637 | P53_HUMAN | 47 | P | Pro | S | Ser | 0.157 | -5.51 | 0.44 | 0.156 | 0.009 | 0.004 | 0.975 | 0.000006 | 0.01973 | 0.079916 | NA | 0.2708584 | 0.722319 | 0.702456 | 0.698795 | 0.735409 | -7.900 | -3.332 | -0.212 | -1.044 | 0.000 | 0.000 | 0.001 | 1.8607 | 434 | 0.320 | Benign |
| NM_000546.5:c.215C>G | NC_000017.10:g.7579472G>C | NC_000017.11:g.7676154G>C | 7676154 | G | C | P | R | ENST00000269305 | ENSP00000269305 | P04637 | P53_HUMAN | 72 | P | Pro | R | Arg | 0.258 | -5.45 | -0.23 | 0.083 | 0.083 | 0.045 | 1.355 | 0.370853 | 0.36143 | 0.186080 | NA | 0.4004812 | 0.722319 | 0.702456 | 0.698795 | 0.735409 | 1.870 | 1.438 | 1.141 | 0.672 | 0.003 | 0.001 | 0.001 | 9.7733 | 434 | 0.368 | Benign |
| NM_000546.5:c.91G>A | NC_000017.10:g.7579705C>T | NC_000017.11:g.7676387C>T | 7676387 | C | T | V | I | ENST00000269305 | ENSP00000269305 | P04637 | P53_HUMAN | 31 | V | Val | I | Ile | 0.635 | -5.43 | -0.14 | 0.082 | 0.001 | 0.002 | 0.935 | 0.362815 | 0.01403 | 0.066351 | NA | NA | 0.722319 | 0.702456 | 0.698795 | 0.735409 | -4.850 | -0.909 | -0.690 | -0.197 | 0.000 | 0.319 | 0.969 | 7.7560 | 434 | 0.569 | Benign |
| NM_000546.5:c.993+217G>C | NC_000017.10:g.7576636C>G | NC_000017.11:g.7673318C>G | NA | NA | NA | E | Q | NA | NA | NA | NA | 339-beta | E | Glu | Q | Gln | NA | NA | NA | NA | NA | NA | NA | NA | 0.69465 | 0.052691 | 0.071 | NA | 0.671770 | 0.702456 | 0.596874 | 0.692231 | 1.600 | 0.431 | -0.524 | -0.229 | 0.074 | 0.132 | 0.969 | 6.6486 | 436 | 0.325 | Benign |
| NM_000546.5:c.993+218A>G | NC_000017.10:g.7576635T>C | NC_000017.11:g.7673317T>C | NA | NA | NA | E | G | NA | NA | NA | NA | 339-beta | E | Glu | G | Gly | NA | NA | NA | NA | NA | NA | NA | NA | 0.61753 | 0.054952 | 0.078 | NA | 0.671770 | 0.702456 | 0.596874 | 0.692231 | 1.510 | 0.093 | -0.368 | 0.661 | 0.071 | 0.205 | 0.978 | 4.3668 | 436 | 0.254 | Benign |
| NM_000546.5:c.993+220A>G | NC_000017.10:g.7576633T>C | NC_000017.11:g.7673315T>C | NA | NA | NA | N | D | NA | NA | NA | NA | 340-beta | N | Asn | D | Asp | NA | NA | NA | NA | NA | NA | NA | NA | 0.67094 | 0.074269 | 0.075 | NA | 0.671770 | 0.702456 | 0.596874 | 0.692231 | 1.200 | 0.561 | 0.879 | 0.661 | 0.063 | 0.717 | 0.984 | 4.4007 | 436 | 0.269 | Benign |
| NM_000546.5:c.993+223T>G | NC_000017.10:g.7576630A>C | NC_000017.11:g.7673312A>C | NA | NA | NA | C | G | NA | NA | NA | NA | 341-beta | C | Cys | G | Gly | NA | NA | NA | NA | NA | NA | NA | NA | 0.85353 | 0.135663 | 0.279 | NA | 0.671770 | 0.702456 | 0.596874 | 0.692231 | 1.260 | 0.611 | 1.011 | 0.751 | 0.036 | 0.928 | 0.991 | 4.1073 | 436 | 0.228 | Benign |
| NM_000546.5:c.993+281T>G | NC_000017.10:g.7576572A>C | NC_000017.11:g.7673254A>C | NA | NA | NA | L | V | NA | NA | NA | NA | 336-gamma | L | Leu | V | Val | NA | NA | NA | NA | NA | NA | NA | NA | 0.11713 | 0.059946 | 0.581 | NA | 0.562547 | 0.577304 | 0.608884 | 0.322989 | 0.893 | 0.337 | 0.106 | -0.104 | 0.002 | 0.693 | 0.989 | 4.0172 | 439 | 0.248 | Benign |
| NM_000546.5:c.993+283A>C | NC_000017.10:g.7576570T>G | NC_000017.11:g.7673252T>G | NA | NA | NA | L | F | NA | NA | NA | NA | 336-gamma | L | Leu | F | Phe | NA | NA | NA | NA | NA | NA | NA | NA | 0.04800 | 0.058936 | 0.544 | NA | 0.562547 | 0.577304 | 0.608884 | 0.322989 | -0.991 | 0.059 | -0.054 | 0.497 | 0.001 | 0.657 | 0.994 | 3.1993 | 439 | 0.286 | Benign |
| NM_000546.5:c.993+285G>A | NC_000017.10:g.7576568C>T | NC_000017.11:g.7673250C>T | NA | NA | NA | R | Q | NA | NA | NA | NA | 337-gamma | R | Arg | Q | Gln | NA | NA | NA | NA | NA | NA | NA | NA | 0.06029 | 0.057955 | 0.602 | NA | 0.562547 | 0.577304 | 0.608884 | 0.322989 | 0.766 | -0.135 | 0.686 | 0.450 | 0.001 | 0.660 | 0.996 | 4.7998 | 439 | 0.349 | Benign |
| NM_000546.5:c.993+285G>C | NC_000017.10:g.7576568C>G | NC_000017.11:g.7673250C>G | NA | NA | NA | R | P | NA | NA | NA | NA | 337-gamma | R | Arg | P | Pro | NA | NA | NA | NA | NA | NA | NA | NA | 0.08690 | 0.131111 | 0.656 | NA | 0.562547 | 0.577304 | 0.608884 | 0.322989 | 0.766 | -0.135 | 0.686 | 0.450 | 0.001 | 0.660 | 0.996 | 4.7998 | 439 | 0.361 | Benign |
| NM_000546.5:c.993+309C>T | NC_000017.10:g.7576544G>A | NC_000017.11:g.7673226G>A | NA | NA | NA | S | L | NA | NA | NA | NA | 345-gamma | S | Ser | L | Leu | NA | NA | NA | NA | NA | NA | NA | NA | 0.03685 | 0.041398 | 0.291 | NA | 0.671770 | 0.702456 | 0.608884 | 0.322989 | 0.893 | 0.882 | 0.953 | 0.672 | 0.009 | 0.242 | 0.900 | 5.0959 | 439 | 0.261 | Benign |
| NM_000546.5:c.993+311T>C | NC_000017.10:g.7576542A>G | NC_000017.11:g.7673224A>G | NA | NA | NA | S | P | NA | NA | NA | NA | 346-gamma | S | Ser | P | Pro | NA | NA | NA | NA | NA | NA | NA | NA | 0.02266 | 0.092732 | 0.257 | NA | 0.671770 | 0.702456 | 0.608884 | 0.322989 | -0.636 | -0.237 | 0.149 | 0.751 | 0.001 | 0.058 | 0.858 | 3.2310 | 439 | 0.255 | Benign |
| NM_000546.5:c.993+311T>G | NC_000017.10:g.7576542A>C | NC_000017.11:g.7673224A>C | NA | NA | NA | S | A | NA | NA | NA | NA | 346-gamma | S | Ser | A | Ala | NA | NA | NA | NA | NA | NA | NA | NA | 0.03689 | 0.050213 | 0.234 | NA | 0.671770 | 0.702456 | 0.608884 | 0.322989 | -0.636 | -0.237 | 0.149 | 0.751 | 0.001 | 0.058 | 0.858 | 3.2310 | 439 | 0.265 | Benign |
| NM_000546.5:c.993+312C>G | NC_000017.10:g.7576541G>C | NC_000017.11:g.7673223G>C | NA | NA | NA | S | W | NA | NA | NA | NA | 346-gamma | S | Ser | W | Trp | NA | NA | NA | NA | NA | NA | NA | NA | 0.04284 | 0.094164 | 0.438 | NA | 0.671770 | 0.702456 | 0.608884 | 0.662433 | -0.166 | -0.597 | -0.919 | -0.727 | 0.001 | 0.059 | 0.851 | 3.4316 | 439 | 0.387 | Benign |
| NM_000546.5:c.993+312C>T | NC_000017.10:g.7576541G>A | NC_000017.11:g.7673223G>A | NA | NA | NA | S | L | NA | NA | NA | NA | 346-gamma | S | Ser | L | Leu | NA | NA | NA | NA | NA | NA | NA | NA | 0.02373 | 0.043392 | NA | NA | 0.671770 | 0.702456 | 0.608884 | 0.662433 | -0.166 | -0.597 | -0.919 | -0.727 | 0.001 | 0.059 | 0.851 | 3.4316 | 439 | 0.268 | Benign |
Clases inseparables
|
Total No. 1,764 |
(0,0.5] No. 542 |
(0.5,1] No. 1,222 |
|
|---|---|---|---|
| Pathogenicity | |||
| Benign | 16 (0.9%) | 15 (93.8%) | 1 (6.2%) |
| Possibly pathogenic | 465 (26.4%) | 18 (3.9%) | 447 (96.1%) |
| Likely Pathogenic | 90 (5.1%) | 0 (0.0%) | 90 (100.0%) |
| Pathogenic | 102 (5.8%) | 0 (0.0%) | 102 (100.0%) |
| VUS | 1,091 (61.8%) | 509 (46.7%) | 582 (53.3%) |
set.seed(17)
data_model_imputed <- rfImpute(
data = data_model,
Pathogenicity ~ .,
iter = 20,
ntree = 10001
)
## ntree OOB 1 2 3 4
## 10001: 28.38% 12.50% 6.67% 95.56% 70.59%
## ntree OOB 1 2 3 4
## 10001: 28.23% 6.25% 6.67% 95.56% 70.59%
## ntree OOB 1 2 3 4
## 10001: 28.23% 6.25% 6.67% 95.56% 70.59%
## ntree OOB 1 2 3 4
## 10001: 28.23% 6.25% 6.45% 95.56% 71.57%
## ntree OOB 1 2 3 4
## 10001: 28.23% 6.25% 6.45% 95.56% 71.57%
## ntree OOB 1 2 3 4
## 10001: 28.23% 6.25% 6.67% 95.56% 70.59%
## ntree OOB 1 2 3 4
## 10001: 28.53% 6.25% 7.10% 95.56% 70.59%
## ntree OOB 1 2 3 4
## 10001: 27.93% 6.25% 6.24% 95.56% 70.59%
## ntree OOB 1 2 3 4
## 10001: 28.53% 6.25% 6.88% 95.56% 71.57%
## ntree OOB 1 2 3 4
## 10001: 28.38% 6.25% 6.67% 95.56% 71.57%
## ntree OOB 1 2 3 4
## 10001: 28.38% 6.25% 6.67% 95.56% 71.57%
## ntree OOB 1 2 3 4
## 10001: 28.23% 6.25% 6.67% 95.56% 70.59%
## ntree OOB 1 2 3 4
## 10001: 28.38% 6.25% 6.88% 95.56% 70.59%
## ntree OOB 1 2 3 4
## 10001: 28.23% 6.25% 6.45% 95.56% 71.57%
## ntree OOB 1 2 3 4
## 10001: 28.53% 6.25% 6.88% 95.56% 71.57%
## ntree OOB 1 2 3 4
## 10001: 28.38% 6.25% 6.67% 95.56% 71.57%
## ntree OOB 1 2 3 4
## 10001: 28.38% 6.25% 6.88% 95.56% 70.59%
## ntree OOB 1 2 3 4
## 10001: 28.23% 6.25% 6.67% 95.56% 70.59%
## ntree OOB 1 2 3 4
## 10001: 28.23% 6.25% 6.45% 95.56% 71.57%
## ntree OOB 1 2 3 4
## 10001: 28.53% 6.25% 6.88% 95.56% 71.57%
rf_task <- makeClassifTask(
data = data_model_imputed,
target = "Pathogenicity"
)
rf_task
## Supervised task: data_model_imputed
## Type: classif
## Target: Pathogenicity
## Observations: 673
## Features:
## numerics factors ordered functionals
## 25 0 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 4
## Benign Possibly pathogenic Likely Pathogenic
## 16 465 90
## Pathogenic
## 102
## Positive class: NA
estimateTimeTuneRanger(rf_task, num.trees = 10001)
## Approximated time for tuning: 16M 58S
listMeasures(rf_task)
## [1] "featperc" "mmce" "lsr"
## [4] "bac" "qsr" "timeboth"
## [7] "multiclass.aunp" "timetrain" "multiclass.aunu"
## [10] "ber" "timepredict" "multiclass.brier"
## [13] "ssr" "acc" "logloss"
## [16] "wkappa" "multiclass.au1p" "multiclass.au1u"
## [19] "kappa"
set.seed(17)
res <- tuneRanger(
rf_task,
num.trees = 10001,
iters = 70,
iters.warmup = 30,
measure = list(bac),
# tune.parameters = c(
# "mtry",
# "min.node.size",
# "sample.fraction"
# ),
parameters = list(
importance = "permutation",
splitrule = "gini",
replace = F
# probability = T,
),
show.info = F
)
res
## Recommended parameter settings:
## mtry min.node.size sample.fraction
## 1 17 6 0.4309226
## Results:
## bac exec.time
## 1 0.5728293 8.364444
res$model$learner.model
## Ranger result
##
## Call:
## ranger::ranger(formula = NULL, dependent.variable = tn, data = getTaskData(.task, .subset), probability = (.learner$predict.type == "prob"), case.weights = .weights, ...)
##
## Type: Probability estimation
## Number of trees: 10001
## Sample size: 673
## Number of independent variables: 25
## Mtry: 17
## Target node size: 6
## Variable importance mode: permutation
## Splitrule: gini
## OOB prediction error (Brier s.): 0.2359279
set.seed(17)
# Con probabilidad
model <- ranger(
data = data_model_imputed,
Pathogenicity ~ .,
num.trees = 10001,
mtry = res$recommended.pars$mtry,
min.node.size = res$recommended.pars$min.node.size,
replace=F,
sample.fraction = res$recommended.pars$sample.fraction,
importance = res$model$learner.model$importance.mode,
splitrule = res$model$learner.model$splitrule,
keep.inbag = TRUE,
probability = T
)
model
## Ranger result
##
## Call:
## ranger(data = data_model_imputed, Pathogenicity ~ ., num.trees = 10001, mtry = res$recommended.pars$mtry, min.node.size = res$recommended.pars$min.node.size, replace = F, sample.fraction = res$recommended.pars$sample.fraction, importance = res$model$learner.model$importance.mode, splitrule = res$model$learner.model$splitrule, keep.inbag = TRUE, probability = T)
##
## Type: Probability estimation
## Number of trees: 10001
## Sample size: 673
## Number of independent variables: 25
## Mtry: 17
## Target node size: 6
## Variable importance mode: permutation
## Splitrule: gini
## OOB prediction error (Brier s.): 0.2360206
set.seed(17)
# Sin probabilidad
model2 <- ranger(
data = data_model_imputed,
Pathogenicity ~ .,
num.trees = 10001,
mtry = res$recommended.pars$mtry,
min.node.size = res$recommended.pars$min.node.size,
# replace=F,
sample.fraction = res$recommended.pars$sample.fraction,
importance = res$model$learner.model$importance.mode,
splitrule = res$model$learner.model$splitrule,
keep.inbag = TRUE
# probability = T
)
OOBCurve(
model2,
data= data_model_imputed,
task = rf_task,
measures = list(bac)
) -> oobcurve
oobcurve %>%
add_column(num.trees=1:nrow(.)) %>%
ggplot(aes(x=num.trees, y=bac))+
geom_line()+
ylim(.2,.8)+
scale_x_continuous(breaks = seq(0,10001,2000))+
xlab("#Trees")+
ylab("Balanced accuracy")+
ggtitle("Target metric convergence")+
theme(
# axis.text.x = element_blank(),
# axis.ticks.x = element_blank(),
# strip.text = element_text(size = 15),
plot.title = element_text(size = 24, face = "bold"),
axis.title=element_text(size=20,face="bold"),
text = element_text(size=20)
)
ggsave(
"figures/article_figures/Figure 1.pdf",
dpi = 600,
height = 12,
width = 18.75,
# units = "inches"
)
model$predictions %>% head() # OOB
## Benign Possibly pathogenic Likely Pathogenic Pathogenic
## [1,] 0.000000e+00 0.1655445 0.37631839 0.45813710
## [2,] 0.000000e+00 0.3749095 0.22951097 0.39557948
## [3,] 0.000000e+00 0.3955345 0.26600260 0.33846290
## [4,] 0.000000e+00 0.6323163 0.21921811 0.14846561
## [5,] 2.897543e-05 0.7137720 0.08177156 0.20442745
## [6,] 0.000000e+00 0.9280949 0.05204490 0.01986019
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benign Possibly pathogenic Likely Pathogenic
## Benign 15 0 0
## Possibly pathogenic 1 439 72
## Likely Pathogenic 0 8 5
## Pathogenic 0 18 13
## Reference
## Prediction Pathogenic
## Benign 0
## Possibly pathogenic 63
## Likely Pathogenic 4
## Pathogenic 35
##
## Overall Statistics
##
## Accuracy : 0.734
## 95% CI : (0.6989, 0.7671)
## No Information Rate : 0.6909
## P-Value [Acc > NIR] : 0.008098
##
## Kappa : 0.3196
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Benign Class: Possibly pathogenic
## Sensitivity 0.93750 0.9441
## Specificity 1.00000 0.3462
## Pos Pred Value 1.00000 0.7635
## Neg Pred Value 0.99848 0.7347
## Prevalence 0.02377 0.6909
## Detection Rate 0.02229 0.6523
## Detection Prevalence 0.02229 0.8544
## Balanced Accuracy 0.96875 0.6451
## Class: Likely Pathogenic Class: Pathogenic
## Sensitivity 0.055556 0.34314
## Specificity 0.979417 0.94571
## Pos Pred Value 0.294118 0.53030
## Neg Pred Value 0.870427 0.88962
## Prevalence 0.133730 0.15156
## Detection Rate 0.007429 0.05201
## Detection Prevalence 0.025260 0.09807
## Balanced Accuracy 0.517486 0.64442
Ligera diferencia con probability = F
t(model2$confusion.matrix) %>%
confusionMatrix()
## Confusion Matrix and Statistics
##
## true
## predicted Benign Possibly pathogenic Likely Pathogenic
## Benign 15 0 0
## Possibly pathogenic 1 441 74
## Likely Pathogenic 0 7 3
## Pathogenic 0 17 13
## true
## predicted Pathogenic
## Benign 0
## Possibly pathogenic 66
## Likely Pathogenic 3
## Pathogenic 33
##
## Overall Statistics
##
## Accuracy : 0.7311
## 95% CI : (0.6958, 0.7642)
## No Information Rate : 0.6909
## P-Value [Acc > NIR] : 0.01277
##
## Kappa : 0.3018
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Benign Class: Possibly pathogenic
## Sensitivity 0.93750 0.9484
## Specificity 1.00000 0.3221
## Pos Pred Value 1.00000 0.7577
## Neg Pred Value 0.99848 0.7363
## Prevalence 0.02377 0.6909
## Detection Rate 0.02229 0.6553
## Detection Prevalence 0.02229 0.8648
## Balanced Accuracy 0.96875 0.6353
## Class: Likely Pathogenic Class: Pathogenic
## Sensitivity 0.033333 0.32353
## Specificity 0.982847 0.94746
## Pos Pred Value 0.230769 0.52381
## Neg Pred Value 0.868182 0.88689
## Prevalence 0.133730 0.15156
## Detection Rate 0.004458 0.04903
## Detection Prevalence 0.019316 0.09361
## Balanced Accuracy 0.508090 0.63550
prob_estimate <- predict(model, data = data_model_imputed)$predictions
class_prediction <- colnames(prob_estimate)[apply(prob_estimate,1,which.max)]
class_prediction <- ordered(
class_prediction,
levels = c("Benign", "Possibly pathogenic", "Likely Pathogenic", "Pathogenic")
)
confusionMatrix(
class_prediction,
data_model_imputed$Pathogenicity
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benign Possibly pathogenic Likely Pathogenic
## Benign 15 0 0
## Possibly pathogenic 1 461 39
## Likely Pathogenic 0 1 45
## Pathogenic 0 3 6
## Reference
## Prediction Pathogenic
## Benign 0
## Possibly pathogenic 29
## Likely Pathogenic 1
## Pathogenic 72
##
## Overall Statistics
##
## Accuracy : 0.8811
## 95% CI : (0.8542, 0.9046)
## No Information Rate : 0.6909
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7221
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Benign Class: Possibly pathogenic
## Sensitivity 0.93750 0.9914
## Specificity 1.00000 0.6683
## Pos Pred Value 1.00000 0.8698
## Neg Pred Value 0.99848 0.9720
## Prevalence 0.02377 0.6909
## Detection Rate 0.02229 0.6850
## Detection Prevalence 0.02229 0.7875
## Balanced Accuracy 0.96875 0.8298
## Class: Likely Pathogenic Class: Pathogenic
## Sensitivity 0.50000 0.7059
## Specificity 0.99657 0.9842
## Pos Pred Value 0.95745 0.8889
## Neg Pred Value 0.92812 0.9493
## Prevalence 0.13373 0.1516
## Detection Rate 0.06686 0.1070
## Detection Prevalence 0.06984 0.1204
## Balanced Accuracy 0.74828 0.8451
| Benign | Possibly pathogenic | Likely Pathogenic | Pathogenic |
|---|---|---|---|
| 0 | 0.1655 | 0.3763 | 0.4581 |
| 0 | 0.3749 | 0.2295 | 0.3956 |
| 0 | 0.3955 | 0.2660 | 0.3385 |
| 0 | 0.6323 | 0.2192 | 0.1485 |
| 0 | 0.7138 | 0.0818 | 0.2044 |
| 0 | 0.9281 | 0.0520 | 0.0199 |
model$predictions %>%
# predict(model, data=data_model_imputed)$predictions %>%
data.frame() %>%
add_column(pathogenicity = data_model_imputed$Pathogenicity) %>%
data.frame() -> df
df %>% gather("category","prob", Benign:Pathogenic) %>%
data.frame %>%
mutate(
category = ordered(
category,
levels = c(
"Benign",
"Possibly.pathogenic",
"Likely.Pathogenic",
"Pathogenic"),
labels = c(
"Benign",
"Possibly pathogenic",
"Likely Pathogenic",
"Pathogenic")
)
) %>%
ggplot(aes(x = category, y = prob, fill=pathogenicity)) +
geom_boxplot() +
scale_fill_manual(
values=c(
"Green",
"Yellow",
"Orange",
"Red"
)
)+
facet_grid(.~ pathogenicity) +
ggtitle("Observed pathogenicity")+
xlab("Class membership")+
ylab("Probability")+
theme(
# axis.text.x = element_blank(),
# axis.ticks.x = element_blank(),
# strip.text = element_text(size = 15),
plot.title = element_text(size = 24, face = "bold"),
axis.title=element_text(size=20,face="bold"),
text = element_text(size=20),
axis.text.x = element_text(angle = 90, vjust = .3, hjust = .95),
legend.position = "none"
)
ggsave(
"figures/supplementary_figures/Supp. Figure S4.pdf",
dpi = 600,
height = 12,
width = 18.75,
# units = "inches"
)
df %>% gather("category","prob", Benign:Pathogenic) %>%
data.frame %>%
mutate(
category = ordered(
category,
levels = c(
"Benign",
"Possibly.pathogenic",
"Likely.Pathogenic",
"Pathogenic")
)) %>%
ggplot(aes(x = category, y = prob, color = pathogenicity)) +
geom_point(position = "jitter") +
theme(axis.text.x = element_text(angle = 90, vjust = .3, hjust=.95))+
scale_color_manual(values = c(
"Benign" = "green",
"Possibly pathogenic" = "yellow",
"Likely Pathogenic" = "orange",
"Pathogenic" = "red")
)+
xlab("Class membership")
df1 <- data.frame(
ID=1:nrow(data_model_imputed),
observed=if_else(
data_model_imputed$Pathogenicity == "Benign",
1,
0
),
rf_prob = model$predictions[,1]
)
df2 <- data.frame(
ID=1:nrow(data_model_imputed),
observed=if_else(
data_model_imputed$Pathogenicity == "Possibly pathogenic",
1,
0
),
rf_prob = model$predictions[,2]
)
df3 <- data.frame(
ID=1:nrow(data_model_imputed),
observed=if_else(
data_model_imputed$Pathogenicity == "Likely Pathogenic",
1,
0
),
rf_prob = model$predictions[,3]
)
df4 <- data.frame(
ID=1:nrow(data_model_imputed),
observed=if_else(
data_model_imputed$Pathogenicity == "Pathogenic",
1,
0
),
rf_prob = model$predictions[,4]
)
roc1 <- roc(
df1$observed,
df1$rf_prob
)
roc2 <- roc(
df2$observed,
df2$rf_prob
)
roc3 <- roc(
df3$observed,
df3$rf_prob
)
roc4 <- roc(
df4$observed,
df4$rf_prob
)
# coords(
# roc4,
# .90, intput = "sens",
# ret=c("threshold", "specificity", "sensitivity"),
# , transpose = FALSE
# )
thresholds <- data.frame(
thresholds = roc4$thresholds,
sensitivities = roc4$sensitivities,
specificities = roc4$specificities
)
thresholds %>%
filter(
sensitivities >= .9
) %>%
arrange(
sensitivities,
-specificities
) %>%
top_n(1) -> req_cut_off
ggroc(
list(
roc1,
roc2,
roc3,
roc4
),
size = 1.2
)+
theme_minimal() +
labs(
title = "OOB ROC plot",
# x = "Specificity",
y = "Sensibility"
) +
geom_segment(
aes(x = 1, xend = 0, y = 0, yend = 1),
color="grey",
linetype="dashed"
) +
# geom_vline(
# xintercept = req_cut_off$specificities,
# color="grey",
# linetype="dotted",
# size = 1.5
# ) +
# geom_hline(
# yintercept = req_cut_off$sensitivities,
# color="grey",
# linetype="dotted",
# size = 1.5
# ) +
geom_point(
aes(
x = req_cut_off$specificities[1],
y = req_cut_off$sensitivities[1],
color="red",
),
size=20,
shape = 10
)+
scale_color_manual(
name = "AUC",
labels = c(
paste("Benign:",round(roc1$auc,3)),
paste("Possibly pathogenic:",round(roc2$auc,3)),
paste("Likely Pathogenic:",round(roc3$auc,3)),
paste("Pathogenic:",round(roc4$auc,3)),
paste("Threshold:",round(req_cut_off$thresholds,3))
),
values=c("green", "yellow", "orange", "red", "red"),
guide = guide_legend(
override.aes = list(shape = c(rep(NA, 4), 10)))
)+
theme(
# axis.text.x = element_blank(),
# axis.ticks.x = element_blank(),
# strip.text = element_text(size = 15),
plot.title = element_text(size = 34, face = "bold"),
axis.title=element_text(size=20,face="bold"),
text = element_text(size=20),
legend.text=element_text(size=32),
legend.position = c(0.7, 0.2)
)
ggsave(
"figures/article_figures/Figure 3.pdf",
dpi = 600,
height = 12,
width = 18.75,
# units = "inches"
)
df1 <- data.frame(
ID=1:nrow(data_model_imputed),
observed=if_else(
data_model_imputed$Pathogenicity == "Benign",
1,
0
),
rf_prob = predict(model, data = data_model_imputed)$predictions[,1]
)
df2 <- data.frame(
ID=1:nrow(data_model_imputed),
observed=if_else(
data_model_imputed$Pathogenicity == "Possibly pathogenic",
1,
0
),
rf_prob = predict(model, data = data_model_imputed)$predictions[,2]
)
df3 <- data.frame(
ID=1:nrow(data_model_imputed),
observed=if_else(
data_model_imputed$Pathogenicity == "Likely Pathogenic",
1,
0
),
rf_prob = predict(model, data = data_model_imputed)$predictions[,3]
)
df4 <- data.frame(
ID=1:nrow(data_model_imputed),
observed=if_else(
data_model_imputed$Pathogenicity == "Pathogenic",
1,
0
),
rf_prob = predict(model, data = data_model_imputed)$predictions[,4]
)
roc1 <- roc(
df1$observed,
df1$rf_prob
)
roc2 <- roc(
df2$observed,
df2$rf_prob
)
roc3 <- roc(
df3$observed,
df3$rf_prob
)
roc4 <- roc(
df4$observed,
df4$rf_prob
)
ggroc(
list(
roc1,
roc2,
roc3,
roc4
)
)+
theme_minimal() +
labs(
title = "In-sample ROC plot",
x = "Specificity",
y = "Sensibility"
) +
geom_segment(
aes(x = 1, xend = 0, y = 0, yend = 1),
color="grey",
linetype="dashed"
) +
theme(legend.position = c(0.7, 0.2))+
scale_color_discrete(
name = "AUC",
labels = c(
paste("Benign:",round(roc1$auc,4)),
paste("Possibly pathogenic:",round(roc2$auc,4)),
paste("Likely Pathogenic:",round(roc3$auc,4)),
paste("Pathogenic:",round(roc4$auc,4))
)
)
PathogenicPara una sensibilidad requerida del 90% el punto de corte sería 0.07
Estimación del error fuera de muestra optimista
model$predictions %>% summary()
## Benign Possibly pathogenic Likely Pathogenic
## Min. :0.0000000 Min. :0.01313 Min. :0.002577
## 1st Qu.:0.0000000 1st Qu.:0.54143 1st Qu.:0.060108
## Median :0.0000000 Median :0.74496 Median :0.117175
## Mean :0.0216053 Mean :0.67915 Mean :0.142992
## 3rd Qu.:0.0000931 3rd Qu.:0.86763 3rd Qu.:0.204811
## Max. :0.9753581 Max. :0.98100 Max. :0.592558
## Pathogenic
## Min. :0.0008993
## 1st Qu.:0.0462966
## Median :0.0967736
## Mean :0.1562559
## 3rd Qu.:0.2085338
## Max. :0.8375264
model$predictions %>%
data.frame() %>%
filter(Pathogenic > 0.07) %>%
nrow()
## [1] 415
| Predicción | Frecuencia |
|---|---|
| Benign | 15 (2.2%) |
| Possibly pathogenic | 575 (85.4%) |
| Likely Pathogenic | 17 (2.5%) |
| Pathogenic | 66 (9.8%) |
| Predicción | Frecuencia |
|---|---|
| Benign | 15 (2.2%) |
| Possibly pathogenic | 244 (36.3%) |
| Likely Pathogenic | 0 (0.0%) |
| Pathogenic | 414 (61.5%) |
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benign Possibly pathogenic Likely Pathogenic
## Benign 15 0 0
## Possibly pathogenic 1 212 21
## Likely Pathogenic 0 0 0
## Pathogenic 0 253 69
## Reference
## Prediction Pathogenic
## Benign 0
## Possibly pathogenic 10
## Likely Pathogenic 0
## Pathogenic 92
##
## Overall Statistics
##
## Accuracy : 0.474
## 95% CI : (0.4357, 0.5125)
## No Information Rate : 0.6909
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1978
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Benign Class: Possibly pathogenic
## Sensitivity 0.93750 0.4559
## Specificity 1.00000 0.8462
## Pos Pred Value 1.00000 0.8689
## Neg Pred Value 0.99848 0.4103
## Prevalence 0.02377 0.6909
## Detection Rate 0.02229 0.3150
## Detection Prevalence 0.02229 0.3626
## Balanced Accuracy 0.96875 0.6510
## Class: Likely Pathogenic Class: Pathogenic
## Sensitivity 0.0000 0.9020
## Specificity 1.0000 0.4361
## Pos Pred Value NaN 0.2222
## Neg Pred Value 0.8663 0.9614
## Prevalence 0.1337 0.1516
## Detection Rate 0.0000 0.1367
## Detection Prevalence 0.0000 0.6152
## Balanced Accuracy 0.5000 0.6690
coordinates %>%
t() %>%
data.frame() %>%
ggplot(aes(x=threshold, y = value, color = variable)) +
geom_point(aes(y = ppv, col = "ppv")) +
geom_point(aes(y = sensitivity, col = "sensitivity"))+
scale_x_continuous(breaks = seq(0,1,.1))+
scale_y_continuous(breaks = seq(0,1,.1))
| threshold | ppv | sensitivity |
|---|---|---|
| 0.6634399 | 0.6666667 | 0.0980392 |
coordinates %>%
t() %>%
data.frame() %>%
ggplot(aes(x=threshold, y = value, color = variable)) +
geom_point(aes(y = ppv, col = "ppv")) +
geom_point(aes(y = sensitivity, col = "sensitivity"))+
geom_point(
aes(
x = max_ppv$threshold,
y = max_ppv$ppv,
col = "red"
),
shape = 10,
size=20
)+
scale_x_continuous(breaks = seq(0,1,.1))+
scale_y_continuous(breaks = seq(0,1,.1))+
ggtitle("PPV and sensitivity vs OOB probability of membership to pathogenic class")+
scale_color_manual(
name = "",
labels = c(
paste0("Cut-off: ", round(max_ppv$threshold,2)),
paste0("PPV: ", round(max_ppv$ppv,2)),
paste0("Sensitivity: ", round(max_ppv$sensitivity,2))
),
values=c("indianred", "indianred", "steelblue"),
guide = guide_legend(
override.aes = list(
shape = c(10,20,20)
# size = 5
)
)
)+
theme(
plot.title = element_text(size = 34, face = "bold"),
axis.title=element_text(size=20,face="bold"),
text = element_text(size=20),
legend.text=element_text(size=28)
# legend.position = c(0.7, 0.2)
)
ggsave(
"figures/article_figures/Figure 4.pdf",
dpi = 600,
height = 12,
width = 18.75,
# units = "inches"
)
| Predicción | Frecuencia |
|---|---|
| Benign | 15 (2.2%) |
| Possibly pathogenic | 605 (89.9%) |
| Likely Pathogenic | 38 (5.6%) |
| Pathogenic | 15 (2.2%) |
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benign Possibly pathogenic Likely Pathogenic
## Benign 15 0 0
## Possibly pathogenic 1 449 78
## Likely Pathogenic 0 14 9
## Pathogenic 0 2 3
## Reference
## Prediction Pathogenic
## Benign 0
## Possibly pathogenic 77
## Likely Pathogenic 15
## Pathogenic 10
##
## Overall Statistics
##
## Accuracy : 0.7177
## 95% CI : (0.682, 0.7514)
## No Information Rate : 0.6909
## P-Value [Acc > NIR] : 0.0713
##
## Kappa : 0.2316
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Benign Class: Possibly pathogenic
## Sensitivity 0.93750 0.9656
## Specificity 1.00000 0.2500
## Pos Pred Value 1.00000 0.7421
## Neg Pred Value 0.99848 0.7647
## Prevalence 0.02377 0.6909
## Detection Rate 0.02229 0.6672
## Detection Prevalence 0.02229 0.8990
## Balanced Accuracy 0.96875 0.6078
## Class: Likely Pathogenic Class: Pathogenic
## Sensitivity 0.10000 0.09804
## Specificity 0.95026 0.99124
## Pos Pred Value 0.23684 0.66667
## Neg Pred Value 0.87244 0.86018
## Prevalence 0.13373 0.15156
## Detection Rate 0.01337 0.01486
## Detection Prevalence 0.05646 0.02229
## Balanced Accuracy 0.52513 0.54464
La idea es crear una sola función que devuelva las probabilidades estimadas de pertenencia de grupo y la predicción de la clase patogénica del RF normal (la clase con mayor probabilidad), TP53MiPaPred_sens y TP53MiPaPred_ppv. Esta función se aplica en nuevos conjuntos de datos (ej. VUS). El resultado de esta función es un data.frame. De los datos nuevos newdata se predice solo sobre los casos completos para los 25 predictores.
TP53MiPaPred <- function(newdata){
# Probabilidades
prediction <- predict(model, data = na.omit(newdata))$predictions
# Clase predicha con el RF normal
prediction %>%
as.tibble() %>%
mutate(
default_rf = colnames(.)[apply(.,1,which.max)],
default_rf = ordered(default_rf,
levels=c("Benign",
"Possibly pathogenic",
"Likely Pathogenic",
"Pathogenic")
)
) -> prediction
# Clase predicha con TP53MiPaPred_sens
prediction$sens_model <- ifelse(
prediction$default_rf == "Benign", "Benign",
ifelse(
prediction[,4] > 0.07, "Pathogenic",
colnames(prediction[,2:3])[apply(prediction[,2:3],1,which.max)]
)
)
prediction$sens_model <- ordered(
prediction$sens_model,
levels=c("Benign",
"Possibly pathogenic",
"Likely Pathogenic",
"Pathogenic")
)
# Clase predicha con TP53MiPaPred_ppv
prediction$ppv_model <- ifelse(
prediction$default_rf == "Benign", "Benign",
ifelse(
prediction[,4] > max_ppv$threshold, "Pathogenic",
colnames(prediction[,2:3])[apply(prediction[,2:3],1,which.max)]
)
)
prediction$ppv_model <- ordered(
prediction$ppv_model,
levels=c("Benign",
"Possibly pathogenic",
"Likely Pathogenic",
"Pathogenic")
)
return(prediction)
}
De las 1091 variantes VUS 994 (91%) son casos completos (sin datos faltantes)
| Variable | Faltantes | % |
|---|---|---|
| PrimateAI_score | 58 | 5.32 |
| MutPred_score | 24 | 2.20 |
| LRT_score | 10 | 0.92 |
| X29way_logOdds | 7 | 0.64 |
| MutationAssessor_score | 3 | 0.27 |
vus_pred <- TP53MiPaPred(na.omit(vus_variants))
summary(vus_pred)
## Benign Possibly pathogenic Likely Pathogenic
## Min. :0.0000000 Min. :0.1121 Min. :0.0008932
## 1st Qu.:0.0002487 1st Qu.:0.6772 1st Qu.:0.0321051
## Median :0.0088325 Median :0.8452 Median :0.0628154
## Mean :0.0809039 Mean :0.7705 Mean :0.0942466
## 3rd Qu.:0.1050374 3rd Qu.:0.9236 3rd Qu.:0.1158526
## Max. :0.6631587 Max. :0.9932 Max. :0.4047845
## Pathogenic default_rf sens_model
## Min. :0.002988 Benign : 45 Benign : 45
## 1st Qu.:0.022034 Possibly pathogenic:939 Possibly pathogenic:723
## Median :0.040404 Likely Pathogenic : 7 Likely Pathogenic : 0
## Mean :0.054380 Pathogenic : 3 Pathogenic :226
## 3rd Qu.:0.068714
## Max. :0.457839
## ppv_model
## Benign : 45
## Possibly pathogenic:942
## Likely Pathogenic : 7
## Pathogenic : 0
##
##
vus_pred %>%
pull(default_rf) %>%
describeFactors() %>%
data.frame() %>%
rownames_to_column() %>%
remove_rownames() %>%
kable(col.names = c("Predicción", "Frecuencia"), format = "html") %>%
kable_styling()
| Predicción | Frecuencia |
|---|---|
| Benign | 45 (4.5%) |
| Possibly pathogenic | 939 (94.5%) |
| Likely Pathogenic | 7 (0.7%) |
| Pathogenic | 3 (0.3%) |
vus_pred %>%
pull(sens_model) %>%
describeFactors() %>%
data.frame() %>%
rownames_to_column() %>%
remove_rownames() %>%
kable(col.names = c("Predicción", "Frecuencia"), format = "html") %>%
kable_styling()
| Predicción | Frecuencia |
|---|---|
| Benign | 45 (4.5%) |
| Possibly pathogenic | 723 (72.7%) |
| Likely Pathogenic | 0 (0.0%) |
| Pathogenic | 226 (22.7%) |
vus_pred %>%
select(Benign:Pathogenic,sens_model) %>%
gather("category","prob", Benign:Pathogenic) %>%
ggplot(aes(x = category, y = prob)) +
geom_boxplot() +
facet_grid(.~ sens_model) +
theme(axis.text.x = element_text(angle = 90, vjust = .5))+
xlab("Membership probability")
vus_pred %>%
pull(ppv_model) %>%
describeFactors() %>%
data.frame() %>%
rownames_to_column() %>%
remove_rownames() %>%
kable(col.names = c("Predicción", "Frecuencia"), format = "html") %>%
kable_styling()
| Predicción | Frecuencia |
|---|---|
| Benign | 45 (4.5%) |
| Possibly pathogenic | 942 (94.8%) |
| Likely Pathogenic | 7 (0.7%) |
| Pathogenic | 0 (0.0%) |
vus_pred %>%
select(Benign:Pathogenic,ppv_model) %>%
gather("category","prob", Benign:Pathogenic) %>%
ggplot(aes(x = category, y = prob)) +
geom_boxplot() +
facet_grid(.~ ppv_model) +
theme(axis.text.x = element_text(angle = 90, vjust = .5))+
xlab("Membership probability")
data %>%
filter(Pathogenicity!="VUS") %>%
select(cDNA_variant:Mutant_AA_3) %>%
cbind(data_model_imputed) %>%
openxlsx::write.xlsx("data_model_imputed.xlsx")
Para predecir sobre nuevos datos está disopnible el modelo implementado con el paquete ranger y la función de predicción expuesta arriba.
model$predictions contiene las predicciontes OOB sobre los datos imputados.
save(list = c("TP53MiPaPred","model"), file = "load_to_predict.RData")
# require(ranger)
# load("load_to_predict.RData")
Sobre las variantes VUS con predictores completos
data %>%
filter(Pathogenicity=="VUS") %>%
select(cDNA_variant:bStatistic) %>%
na.omit() -> vus_variants_na_omit
data.frame(
vus_variants_na_omit,
TP53MiPaPred(vus_variants_na_omit)
) %>%
openxlsx::write.xlsx("vus_predictions.xlsx")